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Machine vision research began with a single-camera system, but these 
systems had various limitations from having just one point-of-view of the 
environment and no depth information, therefore stereo cameras were 
invented. This paper proposes a hybrid method of a stereo matching 
algorithm with the goal of generating an accurate disparity map critical for 
applications such as 3D surface reconstruction and robot navigation to name 
a few. Convolutional neural network (CNN) is utilised to generate the 
matching cost, which is then input into cost aggregation to increase accuracy 
with the help of a bilateral filter (BF). Winner-take-all (WTA) is used to 
generate the preliminary disparity map. An edge-preserving filter (EPF) is 
applied to that output based on a transform that defines an isometry between 
curves on the 2D image manifold in 5D and the real line to eliminate these 
artefacts. The transform warps the input signal adaptively to allow linear 1D 
filtering. Due to the filter's resistance to high contrast and brightness, it is 
effective in refining and removing noise from the output image. Based on 
experimental research employing a Middlebury standard validation 
benchmark, this approach gives high accuracy with an average non-occluded 
error of 6.71% comparable to other published methods. 
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1. INTRODUCTION 


Machine vision research started with one camera. Due to their single point-of-view, it had many 
limitations, hence stereo cameras were developed to rebuild a 3D environment and fix defects produced by a 
single camera without depth information. Vision or LiDAR are the main approaches to get depth data for 
terrain or surface reconstruction. LIDAR scans the world and creates 3D distinct surfaces. Despite its 
accuracy and vast field views, it is hard to provide the hardware and interfaces and uses a lot of power, hence 
stereo vision is used to overcome it. The overlap between the visual fields of the left and right eyes enables 
humans to analyse and extract depth from seeing objects using two eyes. Stereo vision algorithms replicate 
these complex visuals by modelling them using a number of mathematical approaches. Research on stereo 
vision focuses mostly on stereo matching, which is sometimes referred to as the stereo correspondence 
problem [1]; where incorrect matches between images of several sensors obtruse the actual matches. Stereo 
matching uses arithmetic to find pixels in 2D stereoscopic images that match a 3D scene. Normalised 
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epipolar geometry simplifies finding correspondences on the same epipolar line producing a disparity map as 
the output. Stereo vision allows robots to estimate distance, plan collision-free paths, grasp objects with 
precision, and operate independently in dynamic environments [2]-[5]. Self-driving cars can determine the 
relative position of lane markings and make accurate decisions for trajectory planning, ensuring smooth and 
safe navigation [6]-[9]. Other applications are 3D face recognition [10], surface regeneration [11]-[14], 
robotic surgery [15]-[19], virtual reality [20]-[23], and augmented reality [24]. Numerous research 
publications have been circulated in this topic, and substantial progress has been achieved. The number of 
journal papers and books published by ScienceDirect in the domain of stereo vision between 2000 and 2022 
is presented in Figure 1. Each year, new methods are derived, with a focus on i) accuracy and ii) time 
consumption. When deciding on a stereo method, these criteria should be carefully evaluated. 
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Figure 1. The increasing number of articles in journals or books published by ScienceDirect corresponding to 
the search term "stereo vision" from 2000 until 2022 


The epipolar line is the result of projecting a line from the focus point of the left image onto the image 
plane of the right image [25]. This line denotes probable equivalent locations in the right image for a pixel in the 
left image. The locations and orientations of the cameras, together with other characteristics like focal length 
and field of vision are utilized to decide the orientations and positions of the epipolar lines. Rerectifying images 
along epipolar lines simplifies disparity candidates along a single x-axis [26], [27]. Common image rectification 
techniques may be broken down into three distinct transformation categories: projective, affine, and shearing. 
Stereo correspondence mechanism is then utilized to identify the pixels in the stereo images that are identical. 
Using the triangulation principle as shown in (1), the depth information is finally retrieved: 


f.b .b 


z= =— (1) 


XL-XR d 


where z is the depth, f is the camera focal length, b is the baseline space in the middle of cameras' optical 
centre, with d as disparity. In a given stereo arrangement with constant f and b, the disparity scope limits the 
depth range into [dnin, dmax]. Erronous pixels and other ambiguities in the input images will present an 
explicit influence on the quality of the output map. As illustrated in Figure 2, Scharstein and Szeliski [1] first 
suggested a common outline for stereo vision processes, which was executed by utilizing a sequenced 
multi-stage system. The framework receives the input image pair from a stereo camera that act as the stereo 
sensor. This framework relies on the assumption that the image pair input has been rectified. In the first 
stage, the cost function is computed to determine the degree of similarity among patches in the input image 
pair. Secondly, cost aggregation was conducted, followed by the use of filters to remove noise. The process 
was then replicated for the rest of the pixels in the left image. After that, disparity improvement is achieved 
by applying optical low-pass filter to refine it. Winner-take-all (WTA) strategy is then implemented where 
only the disparity with the lowest cost stays active while all other disparity candidates are shut down. 

Stereo matching research can be categorized as a global or local optimization strategy based on how 
the disparity is computed. Local methods utilise disparity depending on the connection between pixel 
intensities (grayscale, RGB colours, texture patterns) inside a particular local support window. Sum of 
absolute differences (SAD), sum of squared differences (SSD), and normalised cross-correlation (NCC) are 
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some example of local methods. It estimates the disparity by comparing the surroundings of a pixel p in the 
left image to the surroundings of a pixel q in the right image, where q has been translated across a candidate 


disparity p as illustrated in Figure 3. Computation of this method is usually fast. However, the quality suffers, 
especially in the depth discontinuity area [28], [29]. 
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Figure 2. A sequential multistage stereo vision model [1] 
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Figure 3. Matching windows in matching cost computation [30] 


Global optimization strategies approach the disparity question as a process of reducing a preset global 
energy function. A number of solutions have been created by means of a markov random field (MRF). These 
methods may either be categorised as graph cut (GC) or belief propagation (BP) methods. The BP method 
lowers the energy function by continuously sending signals from the current node to nearby nodes in the MRF 
network [31], [32]. Since [33]-[35]’s introduction of convolutional neural network (CNN) trained on tiny image 
patch pairs with known actual disparity, attention to a deep learning-based stereo vision system has grown 
significantly. CNN outperforms traditional approaches in terms of error rate and processing time, but it remains 
challenging to identify optimal corresponding spots in fundamentally ill-posed areas, such as areas with 
repetitive shapes, areas that are obscured, reflective planes, and texture-less areas. This study discusses a stereo 
corresponding method that utilises a deep learning-based hybrid approach to generate stereo image features for 
computing the matching cost and an EPF to construct the resulting disparity map. CNN is used to extract 
features from the image pairs dataset and compute the corresponding cost, whereas BF and WTA aggregate 
costs and refine discrepancies. 


2. PROPOSED METHOD 

Figure 4 is a representation of the suggested approach for the experiment. CNN is first employed to 
obtain image pair information and determine the likeness measures. In cost aggregation, a BF that preserves 
edges while eliminating raw noise is used. WTA technique is then used to compute disparity by substituting 
minimal disparity value for minimum cost aggregation. Left and right image checking is performed to 
identify acceptable and unacceptable pixels in the image. The process generates obstruction zones and 
erroneous pixels, particularly in low-texture regions. By filling in, inaccurate disparity map image element 
are replaced with acceptable values. In the last step, an EPF [36] is performed to remove any residual noise 
created from the previous filling in procedure. 


2.1. Matching cost computation 

The convolution layer is a crucial component of CNN, comprising of a sequence of square-shaped 
kernels. Despite their small size, these filters will accommodate the whole volume depth. As the convolution 
layer is constructed, the quantity of layer depth will correspond to the number of filters employed in the layer 
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immediately before it. Each kernel reads the input region, total up the dot product, and saves the output in an 
activation layer. Next, the feature map layer is integrated to construct the input volume for the subsequent 
network layer. Rectified linear unit (ReLU), a non-linear activation function, is added after each convolution 
layer. ReLU keeps the dispensation required to operate a neural network from growing exponentially. The 
computation cost of introducing new ReLU layers to a growing CNN increases linearly. Pooling layer gradually 
decreases the dimension of the input, decreasing system parameters and processing, and aids in preventing 
fitting problem. The cost of matching may therefore be easily determined based on the CNN output. 


Left Image Right Image 


Invalid Pixel LR Consistency ; , 


Disparity Map 


Figure 4. Block diagram of the proposed method 


2.2. Cost aggregation 

Cost aggregation is designed to lower corresponding ambiguity by applying a filter to smooth out 
the high noise at the preliminary raw matching cost. It is required because the information collected for a 
single pixel when computing the matching cost is insufficient for accurate matching. Bitwise operations on 
binary strings were utilised to construct the cost aggregate volume in [37]. This approach is quick since it 
requires little computing, but its efficiency is low. It treats binary numbers that are comparable in different 
regions of interest. Another method is by doing segmentation which was projected in [38] where segment 
tree (ST) was utilised in which pixels were sorted by reference colour and intensity into distinct segments. 
This approach yielded precise answers for the textured sections, but poor precision for the plain colour and 
non-textured parts. BF is used to boost the precision at the object's edges and minimise the noise on the 
insides of the edges. The BF kernel is represented by (2): 


BF [Ip = gy Zass Gos (IP — Gor (lp — Iga 2) 


where Gos is ‘space’ parameter which determines the positive effect of faint pixels, Gor is ‘ranmge’ 
parameter which determines the impact of pixel q having a concentration amount varying from J). 


2.3. Disparity computation 
Disparity computation determines the specific combination of disparities for which normalisation of 
disparity values happens. In this stage, WTA is implemented which can be illustrated as in (3): 


d, = arg mingep C(p, d) G3) 


where dp is the disparity with the least expensive cost, C(p,d) is the cumulative cost calculated in the second 
stage, D is the set of all acceptable individual disparity, with the highest amount selected depending on the 
true disparity map. Because local approaches accumulate support areas by adding or making an even 
distribution of them, their precision is susceptible to noise and ambiguity areas. 
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2.4. Disparity refinement 

In the last stage, postprocessing and disparity refinement standard techniques were executed. Peng 
et al. [39] implemented the integration of mean shift and superpixel with segmentation method (SEG). This 
approach clusters the disparity map based on colour. The weighted median filter (WMF) is used in [40]. 
WMF combined box aggregation with a weighted median. This approach effectively removes noise from 
outliers while preserving the edges. The median filter (MF) was utilised in [41], [42]. The precision was good 
around the edges, but MF generates a significant amount of inaccuracy in places with poor texture. BF was 
utilised to improve the edge qualities, although its processing time was longer in [43]. This proposed 
technique finishes with several continuous processes, starting with occlusion management, incorrect pixel 
management, and noise reduction. To remove artifacts, the kernel is applied with an EPF based on a 
transform that specifies an isometry between curves on the 2D image manifold in 5D and the actual line, 
which then performs high-quality edge-preserving filtering on images. The transform preserves the geodesic 
distance between points on these curves, warping the input signal adaptively so that 1D filtering can be 
carried out in linear time and efficiently. 


3. RESULTS AND DISCUSSION 

This experiment is done using a Windows 10 with Intel Xeon 2.80GHz CPU, an Nvidia Quadro 
P1000 GPU with 8GB DDR4 RAM. The accuracy is assessed utilising Middlebury v3 benchmarking system 
where training images are assessed on the percentage of erroneous pixels in both occluded and non-occluded 
(NON-OCC) pixels. The deep learning component retains the parameters as in the experiment by [34]. We 
explored different approaches, including bitwise operations on binary strings and segmentation-based 
methods, such as ST to smoothen the matching cost and enhance precision. By considering support zones and 
utilizing techniques like BF [44], we improved the matching results, especially at object edges and textured 
regions. We employed cumulative cost calculations and selected the disparity with the lowest cost for each 
pixel. While local approaches is susceptible to noise and ambiguity, our method accounted for these 
limitations by considering neighboring pixel information and achieving substantial improvements in the final 
disparity map accuracy. Table 1 illustrates the ground truth from Middlebury compared with the output 
disparity map from the proposed system along with the error rate. 


Table 1. The ground truth of Middlebury v3 dataset compared to our algorithm output 
Adiron ArtL Jadepl Motor MotorE 


Image 
Ground truth 


Ours 
NON-OCC 8.05% 
Image PianoL 


Ground truth 


Ours 
NON-OCC 8.14% 
Image Shelvs 


Ground truth 


Ours 


NON-OCC 4.49% 2.93% 12.10% 3.16% 6.88% 
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The quantitative evaluation results are shown in Table 2 for NON-OCC error and Table 3 for all 
error. The proposed method achieves the lowest error for NON-OCC error in Table 2 for the recycle bin and 
teddy bear photos, with 2.93% and 3.16% errors, respectively. Complicated images, such as jadeplant and 
shelves exhibit the worst accuracy at 22.10% and 12.10% owing to the system's incapacity to identify 
repetitive shapes, such as the tiny leaf bundles in the jadeplant image. Other areas that are difficult to match 
include textureless objects and shadows, such as ArtL and recycle images. These regions contain similar 
pixel values, and the likelihood of a mismatch is high. The technique proposed in this article, on the other 
hand, pinpointed the exact location of the difference. However, the proposed technique can reconstruct a 
nearly exact disparity map with distinctive discontinuities. The disparity level is applied exactly when the 
distance contours are clearly distinguished. The suggested approach reduces salt-and-pepper noise while 
preserving the dividing lines on the margins. Tables 2 and 3 present several published techniques to 
demonstrate the performance of the proposed work. The Middlebury evaluation shows that the proposed 
stereo corresponding approach can produce accurate results with an average NON-OCC error of 6.71%. It 
demonstrates that the suggested approach is comparable to recently published techniques and can be 
implemented as a comprehensive algorithm. 


Table 2. Comparative results for NON-OCC error using Middlebury v3 evaluation platform 


Method Avg Adi Art Jad Mo Pia Pipes Play Playt PlaytP Recy Shelv Tedd  Vintg 
Err ron L epl tor no rm c s y e 
SGM 5.29 1.85 4.25 1460 2.76 3.61 5.63 4.24 15.70 2.67 2.31 7.78 1.50 13.90 
[45] 
Ours 6.71 4.04 805 2210 443 422 814 5.53 5.02 4.49 2.93 12.10 3.16 6.88 
ELAS 7.65 5.66 3.15 2910 315 441 6.07 6.37 16.90 2.70 5.82 10.70 2.24 16.70 
[46] 


PSMNet 9.60 7.32 969 4450 5.55 5.01 9.86 7.33 4.40 4.43 3.73 11.10 3.44 8.07 
_ROB 


[47] 
BSM 13.4 7.27 1140 3050 667 10.80 1050 12.50 24.40 12.80 742 1640 4.88 32.80 
[37] 
Table 3. Comparative results for ALL error using Middlebury v3 evaluation platform 
Method Avg Adi Art Jad Mo Pia Pipes Play Playt PlaytP Recy Shelv Tedd Vintg 
Err ron L epl tor no rm c s y e 
SGM 8.51 2.46 7.83 32.1 5.17 5.12 11.3 6.15 18.5 3.59 2.84 8.35 2.53 15.5 
[45] 
ELAS 10.9 6.68 S3 51.2 5.36 4.99 11.1 8.97 18.7 3.56 6.43 11.1 2.99 17.1 
[46] 
Ours 12.5 8.08 13.10 5340 9.00 5.56 16.20 8.92 9.50 7.47 5.76 13.30 4.79 10.00 


PSMNet 13.3 8.83 13.9 68.4 8.26 5.89 14.4 9.38 5.54 5.52 4.98 11.6 3.87 9.66 
_ROB 

[47] 

BSM 23.5 12.7 28.7 58.7 14.8 16 24.5 29.4 31 20.2 12.1 19.2 14.3 39.3 
[37] 


4. CONCLUSION 

Our research in stereo vision has led to the development of a comprehensive method that 
incorporates matching cost computation using CNN, cost aggregation to reduce ambiguity, disparity 
computation through WTA optimization, and an EPF as a post-processing technique for refinement. Our 
method achieved accurate disparity estimation by effectively addressing noise, occlusion, and texture-related 
challenges. Comparative evaluations demonstrated its competitive performance, with an average NON-OCC 
error of 6.71% on the Middlebury benchmark. Future research efforts should focus on addressing the 
limitations associated with repetitive shapes, textureless regions, and improving the robustness of disparity 
estimation in challenging scenarios. By continuing to refine the proposed method, we can contribute to the 
ongoing progress in stereo vision and its applications, ultimately enhancing our understanding and utilization 
of 3D perception. 
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